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This paper evaluates the suitability of new embedded Graphic 
Processing Units with 192 single precision cores (GPU) in the 
Nvidia’s Tegra K1 (K1) System-on-Chip (SoC) with typical Typi- 
cal Design Power (TDP) under 7W [1] for onboard processing. 
The performance of this SoC is compared to two modern High 
Performance Computing (HPC) architectures: 
(1) General purpose multi-core CPU (8-core Sandy 
Bridge E5-2470, 2.3GHz, TDP 95W [2]) 
(2) GPU accelerator (Nvidia Tesla K20 (K20), TDP 
225W [3)). 


For this study, we selected two algorithms: 


Wavelet Spectral Dimension Reduction of Hyperspectral Imag- 
ery 

The principle of this method is to apply a discrete wavelet trans- 
form to hyperspectral data in the spectral domain and at each pixel 
location. The optimal level of wavelet decomposition is computed 
adaptively for each pixel. See [4] for more details. 


Automated Cloud-Cover Assessment (ACCA) Algorithm 

The ACCA algorithm determines and rates the overall cloud cover 
of an image through 2 steps: Pass-One isolates clouds from non 
clouds by utilizing eight threshold-based filters, then Pass-Two re- 
solves the detection ambiguities from Pass-One by computing 
global statistics over the image. See [5] for more details. 


This paper shows that the performance achieved using this 
new SoC designed for battery powered devices is comparable 
to HPC hardware with significantly higher power consumption. 


In order to gain optimal performance we had to redesign the 
original algorithms to support SIMD processing. Tegra K1 
achieved (1) 51% for ACCA algorithm and (2) 20% for the di- 
mension reduction algorithm, as compared to the perfor- 
mance of the high-end 8-core server Intel Xeon CPU. Both algo- 
rithms use only a GPU part of the SoC, leaving the 4+1 ARM 
Cortex A15 general-purpose cores available for other tasks. 
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FOR EMBEDDED NVIDIA KEPLER GPU ARCHITECTURE 


NVidia Tegra K1 (GPU part) __| 8-core Intel Sandy Bridge E5-2470 _| Nvidia TeslaK20GPU__ 
general purpose CPU for HPC 
Frequenc 0.706GHz 
128KB L2 per chip 20 MB L3 per chip 1536KB L2 per chip 


Mem. Size; Bandwidth | 2GB at Jetson TK1; 14.9 GBPS up to 384 GB; 38.4 GBPS 5GB; 208 GBPS 
7W (SoC + DRSM only) 95W (CPU only) 225W (accelerator only) 


Main parameters of the selected hardware architectures 
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Speedup achieved by vectorization for CPU is between 4.7 —Processing time variation based on input data with various 
and 5.8. The processing time variation of original algorithm cloud coverage 1%, 13%, 26%, 37%, 55% and 66% for CPU. 
for different input data is 15.2%. The values above the bars The values above the bars describe the difference in process- 
show the speedup for different scenes with various cloud ing time: negative values means slower than vectorized w/o 
coverage from 1% to 66%. branching algorithm. 
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Processing time variation based on input data with various cloud coverage 1%, 13%, 26%, 37%, 55% and 66% for Tesla K20 
and Tegra K1 GPUs. The values above the bars describe the difference in processing time: negative values mean slower than 
vectorized no-branching algorithm. Image size is 2048x2048 pixels. . 
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Optimal vector length for the ACCA algorithm running on Optimal number of threads per block for ACCA on Tesla K20 
CPU is 32. and Tegra K1 is 128. 
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Chip-to-chip performance comparison of the vectorized ACCA algorithm without branching for Chip-to-chip performance comparison of the Wavelet Spectral Dimension Reduction algorithm for image size 


image size 2048x2048 pixels. 


100,000 pixels. 
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